This dataset explores the factors contributing to keratoconus (KTCN). Keratoconus is a disorder of the eye that results in progressive thinning of the cornea, which may result in blurry vision, double vision, nearsightedness, astigmatism, and light sensitivity. This study performed comprehensive transcriptome profiling of human KTCN corneas using an RNA-Seq approach, the discovery analysis comparing eight KTCN (test condition) and eight non-KTCN (control) corneas.
This dataset was of interest to me given the complexity of disorders associated with vision. In the past, I have worked on research detailing the evolution of light sensitivity and specific proteins in mammalian lineages; as such, I was interested in seeing how the transcriptome can be affected in a cornea disease, and how that fits in with my existing knowledge on the subject.
First we install any necessary packages.
if (!require(BiocManager)){
install.packages("BiocManager")
}
## Loading required package: BiocManager
## Warning: package 'BiocManager' was built under R version 3.6.2
## Bioconductor version 3.10 (BiocManager 1.30.10), ?BiocManager::install for help
if (!require(GEOmetadb)){
BiocManager::install("GEOmetadb")
}
## Loading required package: GEOmetadb
## Loading required package: GEOquery
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 3.6.2
if (!require(biomaRt)){
BiocManager::install("biomaRt")
}
## Loading required package: biomaRt
## Warning: package 'biomaRt' was built under R version 3.6.3
if (!require(dplyr)){
install.packages("dplyr")
}
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:biomaRt':
##
## select
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if (!require(edgeR)){
BiocManager::install("edgeR")
}
## Loading required package: edgeR
## Warning: package 'edgeR' was built under R version 3.6.2
## Loading required package: limma
## Warning: package 'limma' was built under R version 3.6.2
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
First we download the dataset, if it isn’t already downloaded.
library(GEOmetadb)
if (!exists("sfiles")){
sfiles = getGEOSuppFiles('GSE77938')
# the second file is the discovery gene counts, contrary to the transcript counts or the replication data.
discoveryCounts = read.delim(rownames(sfiles)[2],header=TRUE)
head(discoveryCounts)
}
Compute overview statistics to assess data quality for the control and test conditions in your dataset.
dim(discoveryCounts)
## [1] 60387 17
We then map to HUGO identifiers. We will map it manually through biomaRt, and check if all the gene symbols are actually correct HGNC symbols rather than aliases. This will also find any missing values.
library(biomaRt)
library(dplyr)
conversion_stash <- "discounts_id_conversion.rds"
if(file.exists(conversion_stash)){
discounts_id_conversion <- readRDS(conversion_stash)
} else {
ensembl <- useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
discounts_id_conversion <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = discoveryCounts$GeneID,
mart = ensembl)
saveRDS(discounts_id_conversion, conversion_stash)
}
names(discoveryCounts)[names(discoveryCounts) == "GeneID"] <- "ensembl_gene_id"
discoveryCountsMapped <- dplyr::left_join(discoveryCounts, discounts_id_conversion, by=c("ensembl_gene_id"))
## Warning: Column `ensembl_gene_id` joining factor and character vector, coercing
## into character vector
We can check if there are any duplicates with gene counts, and then verify later with duplicated ().
summarized_gene_counts <- sort(table(discoveryCounts$ensembl_gene_id),decreasing = TRUE)
summarized_gene_counts[which(summarized_gene_counts>1)[1:10]]
##
## <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
##
First, we omit those that aren’t mapped to an HUGO symbol name.
length(which(is.na(discoveryCountsMapped$hgnc_symbol)))
## [1] 4004
discoveryCountsClean <- discoveryCountsMapped[which(!is.na(discoveryCountsMapped$hgnc_symbol)),]
4004 genes were omitted this way.
Next we check for multiple ensembl ID’s that map to the same HUGO symbol.
discoveryCountsClean <- discoveryCountsMapped[which(!is.na(discoveryCountsMapped$hgnc_symbol)),]
length((discoveryCountsClean$ensembl_gene_id)) - length(unique(discoveryCountsClean$ensembl_gene_id))
## [1] 3
There are three ensembl gene ID’s that are duplicated.
duplist <- discoveryCountsClean[duplicated(discoveryCountsClean$ensembl_gene_id),]
duplist
discoveryCountsClean[which(discoveryCountsClean$ensembl_gene_id%in%duplist$ensembl_gene_id),]
We can see that this is the same ensembl gene ID mapped to different hgnc symbols. These are likely outdated aliases of current symbols, or possibly deprecated ORFs. Because there are so few of these, we can look them up manually.
C12orf74 is a depreciated ORF, and so PLEKHG7 should be kept. LINC00595 is the correct current symbol, and LINC00856 is outdated. CCL3L3 is the correct current symbol, and CCL3L1 is outdated.
Note that the values are the same for each pair of rows, and thus this manipulation doesn’t impact our data. These don’t count as removed genes.
discoveryCountsClean <- discoveryCountsClean[!(discoveryCountsClean$hgnc_symbol%in%c("C12orf74", "LINC00856", "CCL3L1")),]
discoveryCountsClean[which(discoveryCountsClean$ensembl_gene_id%in%duplist$ensembl_gene_id),]
Now we can make the same analysis to find more than one ensembl ID mapped to the same hgnc_symbol.
duphgnc <- discoveryCountsClean[duplicated(discoveryCountsClean$hgnc_symbol),]
duphgnc
length(duphgnc)
## [1] 18
A lot of these are marked to a ‘blank’ HGNC symbol. We can remove these. 18 genes are removed this way.
discoveryCountsClean <- discoveryCountsClean[which(!(discoveryCountsClean$hgnc_symbol == "")),]
duphgnc <- discoveryCountsClean[duplicated(discoveryCountsClean$hgnc_symbol),]
duphgnc
There are 13 duplicate rows. These could be splice variants, or probes that are designed as internal controls. Do they have the same data?
discoveryCountsClean[which(discoveryCountsClean$hgnc_symbol%in%duphgnc$hgnc_symbol),]
Unlike the ensembl ID’s, these don’t have the same data. As such deleting them might influence our data. Instead we’ll append version ID’s to the HGNC symbols to be able to separate them and have unique HGNC symbols.
discoveryCountsClean[duplicated(discoveryCountsClean$hgnc_symbol),]$hgnc_symbol <-
paste(discoveryCountsClean[duplicated(discoveryCountsClean$hgnc_symbol),]$hgnc_symbol, ".2")
Now we have no more duplicates in our data:
length((discoveryCountsClean$ensembl_gene_id)) - length(unique(discoveryCountsClean$ensembl_gene_id))
## [1] 0
length((discoveryCountsClean$hgnc_symbol)) - length(unique(discoveryCountsClean$hgnc_symbol))
## [1] 0
A total of 4022 genes have been removed so far.
So now we can assign rownames to our data.
rownames(discoveryCountsClean) <- discoveryCountsClean$hgnc_symbol
Now, we have dealt with repeating mappings in both directions and removed any unmapped outliers. We proceed to normalizing the dataset. In edgeR, it is recommended to remove features without at least 1 read per million in n of the samples, where n is the size of the smallest group of replicates. There are 16 samples in the group, so n = 16.
cpms = cpm(discoveryCountsClean[,2:17])
rownames(cpms) <- discoveryCountsClean[,1]
# get rid of low counts
keep = rowSums(cpms >1) >=3
discoveryCountsFiltered = discoveryCountsClean[keep,]
discoveryCountsOutliers = discoveryCountsClean[!keep,]
nrow(discoveryCountsFiltered)
## [1] 15810
nrow(discoveryCountsOutliers)
## [1] 21428
As such we’ve removed 21 428 outliers due to low counts. What does this do to the data?
dim(discoveryCountsFiltered)
## [1] 15810 18
The final coverage of our dataset is 15 810 genes. We have a lot less genes covered now, but the dataset is of better quality.
data2plot <- log2(edgeR::cpm(discoveryCountsFiltered[,2:17]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "KRvsKC RNASeq Samples")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 7 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 8 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 9 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 10 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 11 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 12 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 13 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 14 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 15 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 16 is not drawn
Observe the data distribution by a density plot:
counts_density <- apply(log2(cpm(discoveryCountsFiltered[,2:17])), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="KCvsKR Data Before Normalization", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
As this is RNASeq data, for normalization we will apply Trimmed Mean of M-Values, or TMM. First we define the groups we want to divide by.
samples <- data.frame(lapply(colnames(discoveryCountsFiltered)[2:17],
FUN=function(x){unlist(strsplit(x, split = "_"))[c(1,2)]}))
colnames(samples) <- colnames(discoveryCountsFiltered)[2:17]
rownames(samples) <- c("cell_type","sample_num")
samples <- data.frame(t(samples))
Then we apply the normalisation.
filtered_data_matrix <- as.matrix(discoveryCountsFiltered[,2:17])
rownames(filtered_data_matrix) <- discoveryCountsFiltered$hgnc_symbol
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d = calcNormFactors(d)
normalized_counts <- cpm(d)
Compare before and after of both plots.
counts_density <- apply(log2(cpm(normalized_counts)), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="KCvsKR Data After Normalization", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
The distribution of our data didn’t change significantly, which is good - we want to only remove slight technical variation, not biological variation.
What are the overall metrics of our data?
MDS:
plotMDS(d, labels=rownames(samples),
col = c("darkgreen","blue")[factor(samples$cell_type)])
The conditions themselves seem to cluster the most; there isn’t much clustering between different conditions implying an inherent similarity in a sample or a bias.
Estimating common and tagwise dispersion doesn’t seem to be entirely possible, because the samples for the two conditions are separately numbered in the discovery gene counts. To properly do this, we need to invoke the replicate count matrix. As such, this is omitted in this data preparation.
Code for normalization and data analysis of the dataset is mostly taken from BCB420 course materials.
Final dataset:
head(normalized_counts)
## KR_38 KR_50 KR_51 KR_53 KR_56 KR_57
## TSPAN6 17.6847346 17.931933 21.419169 19.8493200 18.199512 45.1612115
## DPM1 33.8829263 41.168775 33.111375 47.1734102 38.261798 42.2485340
## SCYL3 46.0008141 39.563929 41.151007 49.5086244 53.934790 49.5253243
## C1orf112 7.9248115 8.548038 8.248067 7.7770970 7.247688 6.7962475
## FGR 0.5843651 1.292793 1.806436 0.1772261 6.444768 0.4805428
## CFH 45.4369530 190.051740 61.994498 18.3793861 98.288072 136.2289662
## KR_61 KR_64 KC_35 KC_37 KC_39 KC_40
## TSPAN6 23.2198029 34.6554399 50.601788 46.56161510 20.369263 16.8256110
## DPM1 45.2930919 49.3238362 38.957243 47.17814772 43.513469 37.4793308
## SCYL3 47.8640626 56.2308579 49.530906 54.17445258 50.787363 57.3425856
## C1orf112 11.2682635 10.4567977 8.119991 7.63964324 6.329232 9.1806857
## FGR 0.3937523 0.3730273 0.124763 0.02680577 0.129891 0.2484318
## CFH 102.6303709 46.7848439 52.223707 43.92571480 24.443116 9.1806857
## KC_43 KC_45 KC_47 KC_52
## TSPAN6 33.65623349 20.6153992 32.9678451 34.22118135
## DPM1 39.73688496 38.2531392 41.7541895 41.61720101
## SCYL3 48.59923333 51.2505126 48.1561356 50.95507375
## C1orf112 6.21858686 7.4550152 7.1967456 6.41978751
## FGR 0.01149462 0.2390821 0.1415396 0.09550097
## CFH 49.69122178 23.8647425 38.6294301 49.07688798
#conversion_stash <- "ncounts.rds"
#saveRDS(normalized_counts, conversion_stash)
# discounts_id_conversion <- readRDS(conversion_stash)
In the previous assignment, we had sourced, cleaned, and normalized dataset GSE77938, which can be downloaded from GEO. This dataset explores the factors contributing to keratoconus (KTCN). Keratoconus is a disorder of the eye that results in progressive thinning of the cornea, which may result in blurry vision, double vision, nearsightedness, astigmatism, and light sensitivity. This study performed comprehensive transcriptome profiling of human KTCN corneas using an RNA-Seq approach, the discovery analysis comparing eight KTCN (test condition) and eight non-KTCN (control) corneas (Kabza et al. 2017).
Here, we aim to determine pathways associated with significantly up- and down-reguated genes in KTCN corneas. First, we load the normalized data.(It is necessary to run expr_cleanup.Rmd first.)
normalized_counts <- readRDS("ncounts.rds")
head(normalized_counts)
## KR_38 KR_50 KR_51 KR_53 KR_56 KR_57
## TSPAN6 17.6847346 17.931933 21.419169 19.8493200 18.199512 45.1612115
## DPM1 33.8829263 41.168775 33.111375 47.1734102 38.261798 42.2485340
## SCYL3 46.0008141 39.563929 41.151007 49.5086244 53.934790 49.5253243
## C1orf112 7.9248115 8.548038 8.248067 7.7770970 7.247688 6.7962475
## FGR 0.5843651 1.292793 1.806436 0.1772261 6.444768 0.4805428
## CFH 45.4369530 190.051740 61.994498 18.3793861 98.288072 136.2289662
## KR_61 KR_64 KC_35 KC_37 KC_39 KC_40
## TSPAN6 23.2198029 34.6554399 50.601788 46.56161510 20.369263 16.8256110
## DPM1 45.2930919 49.3238362 38.957243 47.17814772 43.513469 37.4793308
## SCYL3 47.8640626 56.2308579 49.530906 54.17445258 50.787363 57.3425856
## C1orf112 11.2682635 10.4567977 8.119991 7.63964324 6.329232 9.1806857
## FGR 0.3937523 0.3730273 0.124763 0.02680577 0.129891 0.2484318
## CFH 102.6303709 46.7848439 52.223707 43.92571480 24.443116 9.1806857
## KC_43 KC_45 KC_47 KC_52
## TSPAN6 33.65623349 20.6153992 32.9678451 34.22118135
## DPM1 39.73688496 38.2531392 41.7541895 41.61720101
## SCYL3 48.59923333 51.2505126 48.1561356 50.95507375
## C1orf112 6.21858686 7.4550152 7.1967456 6.41978751
## FGR 0.01149462 0.2390821 0.1415396 0.09550097
## CFH 49.69122178 23.8647425 38.6294301 49.07688798
First, we conduct differential expression analysis with this normalized expression set. Based on previously generated MDS, the conditions themselves seem to cluster more than any other factors; and indeed no other factors are noted in the dataset. As such, condition is the only factor we will consider in our model.
samples <- data.frame(
lapply(colnames(normalized_counts),
FUN=function(x){
unlist(strsplit(x, split = "_"))[1]}))
colnames(samples) <- colnames(normalized_counts)
rownames(samples) <- c("cell_type")
samples <- data.frame(t(samples))
head(samples)
We then create a design matrix.
model_design <- model.matrix(~ samples$cell_type)
head(model_design)
## (Intercept) samples$cell_typeKR
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 1
This is inverted from the ‘positive’ and ‘negative’ conditions in the paper (KC is the set of interest (KTCN), KR is the control (non-KTCN).) Thus we reverse it before using the model.
model_design[,2] <- abs(model_design[,2]-1)
head(model_design)
## (Intercept) samples$cell_typeKR
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
The column name is still “celltypeKR”, but that’s alright (has no bearing on the analysis).
Create data matrix and fit the data to the above model.
expressionMatrix <- as.matrix(normalized_counts)
rownames(expressionMatrix) <- rownames(normalized_counts)
colnames(expressionMatrix) <- colnames(normalized_counts)
minimalSet <- ExpressionSet(assayData=expressionMatrix)
fit <- lmFit(minimalSet, model_design)
Apply empircal Bayes to compute differential expression for the above described model. Use trend=TRUE as this is RNASeq Data. We used the Benjamini-Hochberg method for mutiple hypothesis correction. This was chosen because BH is specifically optimized to correct for false discovery rate. This is of particular importance in clinical samples, as a falsely enriched gene might point towards a non-existent potential drug target for the disease. However, given the sample size which isn’t too large, Bonferroni and other methods would be too strict, not allowing potentially ‘true’ discoveries to make it through the correction filter.
fit2 <- eBayes(fit,trend=TRUE)
#edgeR --
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(rownames(normalized_counts),
topfit,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
head(output_hits)
How many gene pass the threshold p-value < 0.05? How many pass the correction?
length(which(output_hits$P.Value < 0.05))
## [1] 4654
length(which(output_hits$adj.P.Val < 0.05))
## [1] 1001
We can show the overall Volcano plot below. The top gene highlighted is BPTF, a transcription factor (the role of transcription factors in KTCN being discussed later in this report).
We can also visualize the top hits using a heatmap. As we can see below, the conditions do cluster together. This is reassuring, as it shows that the expression differences in our top hits are truly associated with the disease phenotype.
With our significantly up-regulated and down-regulated set of genes, we can run a thresholded gene set enrichment analysis. A positive log Fold Change indicates an upregulation in the disease state, and a negative a downregulation in the disease state. Here we can see that we have 2562 upregulated genes, and 2092 downregulated genes.
length(which(output_hits$P.Value < 0.05 & output_hits$logFC > 0))
## [1] 2092
length(which(output_hits$P.Value < 0.05 & output_hits$logFC < 0))
## [1] 2562
Thresholded list:
output_hits_withgn <- merge(rownames(normalized_counts),output_hits)
output_hits_withgn[,"rank"] <- -log(output_hits_withgn$P.Value,base =10) * sign(output_hits_withgn$logFC)
output_hits_withgn <- output_hits_withgn[order(output_hits_withgn$rank),]
upregulated_genes <- output_hits_withgn$x[
which(output_hits_withgn$P.Value < 0.05
& output_hits_withgn$logFC > 0)]
downregulated_genes <- output_hits_withgn$x[
which(output_hits_withgn$P.Value < 0.05
& output_hits_withgn$logFC < 0)]
write.table(x=upregulated_genes,
file="upregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file="downregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
all_genes <- output_hits_withgn$x
I used gProfiler for the thresholded enrichment analysis. This was because it is a common tool, drawing from multiple annotation sources mentioned in class (ex. KEGG, GO, Reactome) and it has a comfortable R interface. This draws the latest versions from each annotation source separately (as of March 3, 2020).
Both when the analysis is run using up-regulated and all genes, the top results are transcription factors (ordered gene lists are used). There does not seem to be a large difference in the type of transcription factors that are shown as enriched using each method. However, it should be noted that the original publication performed overrepresentation analysis on upregulated and downregulated genes separately, so analysing the pathways in this manner is consistent with the way the authors approached the problem.
enrichment <- gost(all_genes, ordered_query = TRUE)
significant_pathways <- enrichment$result$term_name[order(enrichment$result$p_value)]
head(significant_pathways)
## [1] "Factor: ER81; motif: RCCGGAARYN; match class: 1"
## [2] "Factor: Elk-1; motif: NRSCGGAAGNN; match class: 1"
## [3] "Factor: Elf-1; motif: NANCCGGAAGTN; match class: 1"
## [4] "Factor: E2F; motif: GGCGSG; match class: 1"
## [5] "Factor: ELK-1; motif: ACCGGAAGTN; match class: 1"
## [6] "Factor: Elf-1; motif: NAMCCGGAAGTN; match class: 1"
write.table(enrichment$result[,1:13], file="enrichment_result.txt", sep="/t")
How many genes overall with a threshold of P < 0.05?
length(enrichment$result$term_name[which(enrichment$result$p_value < 0.05)])
## [1] 1171
We can also observe where most of these annotations came from and what categories they fall into (most are transcription factors, which is discussed below).
With only up-regulated genes:
enrichment_up <- gost(upregulated_genes, ordered_query = TRUE)
significant_pathways_up <- enrichment_up$result$term_name[order(enrichment_up$result$p_value)]
head(significant_pathways_up)
## [1] "Factor: GKLF; motif: NNCCMCRCCCN; match class: 1"
## [2] "Factor: Sp1; motif: NGGGGGCGGGGCCNGGGGGGGG; match class: 1"
## [3] "Factor: Sp1; motif: NWRGCCACGCCCMCN; match class: 1"
## [4] "Factor: ETF; motif: GVGGMGG; match class: 1"
## [5] "Factor: SP4; motif: SCCCCGCCCCS; match class: 0"
## [6] "Factor: SP4; motif: SCCCCGCCCCS"
Only down-regulated genes:
enrichment_dwn <- gost(downregulated_genes)
significant_pathways_dwn <- enrichment_dwn$result$term_name[order(enrichment_dwn$result$p_value)]
head(significant_pathways_dwn)
## [1] "extracellular matrix organization"
## [2] "extracellular structure organization"
## [3] "biological adhesion"
## [4] "cell adhesion"
## [5] "focal adhesion"
## [6] "cell-substrate junction"
The main up-regulated factors are ETF, SP4, and other transcription factors: overall, what we see is an upregulation in DNA binding transcription factor actiity. Among these are genes associated with retinal diseases - such as E2F, which is known to be upregulated in retinoblasoma as it is associated with Cdk6 and the Ras pathway. This is consistent with the publication, which also specifically mentions upregulation of CREB phosphorylation through the activation of Ras. Other genes are also among those highly enriched, such as the GRIN gene family (ex. GRIN2B), which are also cited in the paper as upregulated, as part of the CREB phosphorylation and as part of the activation of the NMDA receptor upon glutamate binding and postsynaptic events.
The down-regulated pathways present a somewhat clearer and more coherent picture. The main down-regulated pathways are adherens-junction related: “anchoring junction”, “adherens junction”, “extracellular matrix organization” etc. This is supported in the paper as well, as extracellular matrix organization was for the authors also a main down-regulated pathway. The authors state that lower expression of nearly all genes involved in collagen maturation and the ECM stability was observed, and this is replicated in this analysis. This makes sense, as it is known that there are pronounced abnormalities in the extracellular matrix in keratoconus patients (???).
In the previous assignments, we had sourced, cleaned, and normalized dataset GSE77938.
This dataset explores the factors contributing to keratoconus (KTCN). Keratoconus is a disorder of the eye that results in progressive thinning of the cornea, which may result in blurry vision, double vision, nearsightedness, astigmatism, and light sensitivity. This study performed comprehensive transcriptome profiling of human KTCN corneas using an RNA-Seq approach, the discovery analysis comparing eight KTCN (test condition) and eight non-KTCN (control) corneas (Kabza et al. 2017).
Here, we aim to determine pathways associated with significantly up- and down-reguated genes in KTCN corneas. First, we normalized the data. Then, we determined pathways associated with significantly up- and down-regulated genes in KTCN corneas by differential expression analysis.
With our significantly up-regulated and down-regulated set of genes, we ran a thresholded gene set enrichment analysis. This dataset has 2562 upregulated genes, and 2092 downregulated genes. Now, we can run non-thresholded pathway analysis, compare those results to our thresholded pathway analysis, and analyse the pathways associated with this geneset, as well as dark matter present in this dataset.
Non-thresholded pathway analysis was run using GSEA version 4.0.3 (Subramanian 2005). An analysis was run on a pre-ranked gene list with a maximum pathway size of 200 and a minimum pathway size of 15.
The geneset used was obtained from the Bader Lab website, and contained all human pathways without GO information. The version pulled was that current to April 1, 2020. We download it below (Merico et al. 2010).
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred from
# electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents,
perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)
To get the pre-ranked gene list for GSEA, we use our enrichment scores from Assignment 2, and convert that to a .rnk file.
en_data <- output_hits_withgn
en_data <- en_data[,c(1,8)]
write.table(en_data,file="expression_rank.rnk",quote=F,sep="\t",row.names=F)
head(en_data)
We then ran the analysis with GSEA.
The top pathways upregulated in KTCN were, summarised:
Using thresholded analysis, we had found that the main up-regulated factors are ETF, SP4, and other transcription factors: overall, what we see is an upregulation in DNA binding transcription factor actiity. This was consistent with the publication, which also specifically mentions upregulation of CREB phosphorylation through the activation of Ras.
This is consistent with the results from the non-thresholded pathway analysis: most or all of the top pathways are associated with RNA Polymerase I, the main transcription enzyme, or RNA processing of some form, which obviously are all processes tightly related to the transcription process. Combining these two analyses, we can more confidently say that KTCN patients have an upregulation of transcription-related processes.
The top pathways downregulated in KTCN were, summarised:
In the thresholded analysis, we had stated that the main down-regulated pathways are adherens-junction related: “anchoring junction”, “adherens junction”, “extracellular matrix organization” etc. This is supported in the paper as well, as extracellular matrix organization was for the authors also a main down-regulated pathway. Here, we see more of a focus on the downregulation of translation-related mechanisms. This was not noted in the original paper (Kabza et al. 2017).
Using the results from the non-thresholded gene set enrichment analysis, the results were visualized in Cytoscape (Shannon et al. 2003) using EnrichmentMap(Merico et al. 2010). The resulting map was generated using an FDR cutoff of 0.05.
The resulting network was very, very large and hard to summarise or interpret. This is a screenshot of the network before summary or manual layout.
Due to the very large number of nodes, a reasonable understanding of the pathways up- and down- regulated in KTCN could only be obtained by excluding smaller clusters (bottom 20%) and thematically summarising clusters. Annotation was done using AutoAnnotate, with default parameters.
One of the main themes is cardiac development morphogenesis, along with themes relating to actin filament bundle upregulation and the upregulation of the bmp pathway. The actin dysregulation is consistent with the original paper, which noted the disruption of multiple pathways relating to the cytoskeleton and adherens junctions. It is surprising, however, to see such an upregulation of genes relating to cardiac morphogenesis, given that KTCN is a corneal disease. However, the disruption of adherens junctions would also, in other cells, lead to effects on cell morphogenesis. It is likely that the cardiac pathways are very well-annotated due to their medical importance, and that this dominant theme expands upon the original finding of the adherens junction disruption. This is also supported by findings in the literature. (Sheikh, Ross, and Chen 2009) Specifically, a commonality between both themes are intercalated disks (ICDs), which are highly organized cell-cell adhesion structures and connect cardiomyocytes to one another. Furthermore, studies have shown that adherens junctions organize independently of gap junctions and other ICD components, which makes it more likely that this is a true hit and that issues in adherens junctions can influence the heart(Gutstein et al. 2010) Human genetics and mouse models have revealed that mutations and/or deficiencies in various ICD components can lead to cardiomyopathies and arrhythmias. As such, this analysis provides the interesting development that this weakness in adherens junctions might cause issues elsewhere in the body as well, and that this is a topic that should be investigated.
Dark matter genes are genes that are significantly differentially expressed in the model, but not annotated to any pathways. We want to analyse these genes in our dataset. The original paper for this dataset focuses on the disruption of three pathways relating to the adherens junction. By focusing on dark matter genes, we might discover other, less noticed impacts of KTCN.
We follow the following steps:
library(GSA)
gmt_file <- file.path("Human_GOBP_AllPathways_no_GO_iea_April_01_2020_symbol.gmt")
capture.output(genesets<- GSA.read.gmt(gmt_file),file="gsa_load.out")
names(genesets$genesets) <- genesets$geneset.names
expression <- rownames(expressionMatrix)
ranks <- read.table(file.path(getwd(), "expression.rnk"),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE)
#get all the GSEA directories
gsea_directories <- list.files(path = file.path(getwd(),"gsea"),
pattern = "\\.GseaPreranked")
if(length(gsea_directories) == 1){
gsea_dir <- file.path(getwd(),"gsea",gsea_directories[1])
#get the gsea result files
gsea_results_files <- list.files(path = gsea_dir,
pattern = "gsea_report_*.*.xls")
#there should be 2 gsea results files
enr_file1 <- read.table(file.path(gsea_dir,gsea_results_files[1]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
enr_file2 <- read.table(file.path(gsea_dir,gsea_results_files[1]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
}
Collect the Data we need to calculate the dark matter from the above files:
FDR_threshold <- 0.001
#get the genes from the set of enriched pathways (no matter what threshold)
all_sig_enr_genesets<- c(rownames(enr_file1)[which(enr_file1[,"FDR.q.val"]<=FDR_threshold)], rownames(enr_file2)[which(enr_file2[,"FDR.q.val"]<=FDR_threshold)])
genes_sig_enr_gs <- c()
for(i in 1:length(all_sig_enr_genesets)){
current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_sig_enr_genesets[i])])
genes_sig_enr_gs <- union(genes_sig_enr_gs, current_geneset)
}
genes_all_gs <- unique(unlist(genesets$genesets))
We can see that there is a large region of overlap between the genes that are differentially expressed, and those genes that are not enriched in our model.
We look at some of the top ranked dark matter genes.
genes_no_annotation <- setdiff(expression, genes_all_gs)
ranked_gene_no_annotation <- ranks[which(ranks[,1] %in% genes_no_annotation),]
ranked_gene_no_annotation[1:10,]
The highest-scoring gene is SLFN12, associated with Syndromic X-Linked Intellectual Disability Turner Type, which may indicate a potential connection. Many of the other genes on this list, however, have uncertainties in their sequences, or cautions regarding their initiation or transcription. It is likely for this reason that they are not enriched in particular pathways, as they simply haven’t been annotated.
We can look at a heatmap of any significant genes that are not annotated to any pathways in entire set of pathways used for the analysis.
top_hits <- output_hits[which(output_hits$x%in%ranked_gene_no_annotation$x),]$x[1:100]
heatmap_matrix <- normalized_counts
rownames(heatmap_matrix) <- rownames(normalized_counts)
colnames(heatmap_matrix) <- colnames(normalized_counts)
heatmap_matrix <- t(scale(t(heatmap_matrix)))
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
name = "DE",
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
Compare this to a heatmap of the genes that are not annotated to any of the pathways returned in the enrichment analysis, but might be annotated to SOME pathway:
genes_no_annotation <- setdiff(expression, genes_sig_enr_gs)
ranked_gene_no_annotation <- ranks[which(ranks[,1] %in% genes_no_annotation),]
top_hits <- output_hits[which(output_hits$x%in%ranked_gene_no_annotation$x),]$x[1:100]
heatmap_matrix <- normalized_counts
rownames(heatmap_matrix) <- rownames(normalized_counts)
colnames(heatmap_matrix) <- colnames(normalized_counts)
heatmap_matrix <- t(scale(t(heatmap_matrix)))
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
name = "DE",
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
As we can see, both heatmaps are very similar. It is thus likely that this dark matter is due to an annotation issue, and not genes present in pathways that are not shown as enriched.
Packages used:
limma Biobase ComplexHeatmap circlize knitr VennDiagram GSA
Gutstein, DE, FY Liu, MB Meyers, A Choo, and GI Fishman. 2010. “The Organization of Adherens Junctions and Desmosomes at the Cardiac Intercalated Disc Is Independent of Gap Junctions.” Plos One 5 (11).
Kabza, Michal, Justyna Karolak, Malgorzata Rydzanicz, Michal Szczesniak, Dorota Nowak, Barbara Ginter-Matuszewska, Piotr Polakowski, Rafal Ploski, Jacek Szaflik, and Marzena Gajecka. 2017. “Collagen Synthesis Disruption and Downregulation of Core Elements of Tgf-??, Hippo, and Wnt Pathways in Keratoconus Corneas.” European Journal of Human Genetics, no. 25: 582–90.
Merico, Daniele, Ruth Isserlin, Oliver Stueker, Andrew Emili, and Gary D Bader. 2010. “Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation.” Plos One 5 (11).
Shannon, P, A Markiel, O Ozier, NS Baliga, JT Wang, D Ramage, N Amin, B Schwikowski, and T Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Res 13 (11): 2498–2504.
Sheikh, Farah, Robert Ross, and Ju Chen. 2009. “Cell-Cell Connection to Cardiac Disease.” Trends Cardiovasc Med 19 (6): 182–90.
Subramanian, Tamayo et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc Natl Acad Sci U S A 102 (43): 15545–50.